SECOM Case Study

Problem Statement

A sample machine learning data set was provided to perform supervised learning on semi-conductor quality data. The objective was to identify signals that are important in determining pass/fail yield in a manufacturing process.

For more information regarding the problem statement is provided here.

Extract, Transform, and Load

A Python 3.7 backend was developed to:

  • Pull the data from its sources
  • Prepare for a load into Sqlite
  • Commit records and report upload status

In a production environment:

  • Use an enterprise warehouse solution like Hadoop, Teradata, Oracle, SQL Server, MySQL, etc.
  • Schedule the script in Unix via /etc/crontab
#!/usr/bin/python3
import urllib.request
import pandas as pd
import io
import sqlite3
from datetime import datetime
from tqdm import tqdm
dataUrl = "http://archive.ics.uci.edu/ml/machine-learning-databases/secom/secom.data"
labelUrl = "http://archive.ics.uci.edu/ml/machine-learning-databases/secom/secom_labels.data"
vendorUrl = "./data/vendordata.json"
# column names for sqlite
newCols = {
    "datetime": "MFG_DATE",
    "mat vendor": "MAT_VENDOR",
    "part vendor": "PART_VENDOR",
    "sil vendor": "SIL_VENDOR",
    "adhs vendor": "ADHS_VENDOR",
    "sop vendor": "SOP_VENDOR",
}
def dfUpload(df, con, table, timeStamp=True, clearTable=False, debug=False):
    if timeStamp:
        df['INSERTED_ON'] = datetime.now()
    df = df.where(pd.notnull(df), None)  # convert NaN to None, for SQL Nulls
    # just to fix pd.NaT to insert as NULLS
    for col in df.columns:
        if df[col].dtype.kind == 'M':
            df[col] = df[col].astype(object).where(df[col].notnull(), None)
            df[col] = df[col].dt.strftime('%Y-%m-%d %h:%m:%s')
    sqlColumns = '(' + ','.join([col for col in df.columns]) + ')'
    sqlValues = '(' + ','.join([':' + str(x + 1) for x in list(range(len(df.columns)))]) + ')'
    sqlInsert = "INSERT INTO %s %s VALUES %s" % (table, sqlColumns, sqlValues)
    crsr = con.cursor()
    # uploading
    if clearTable:
        crsr.execute("DELETE FROM %s" % table)
    for row in tqdm(df.values.tolist(), desc="Uploading data", unit="row"):
        if debug:
            try:
                crsr.executemany(sqlInsert, [row])
            except:
                print(row)
                pass
        else:
            crsr.executemany(sqlInsert, [row])
    con.commit()
    crsr.close()
def main():
    # tried pd.read_html(), but no tables found?
    def PandasFromUrl(url):
        return pd.read_csv(io.BytesIO(urllib.request.urlopen(url).read()),
                           encoding="utf8", sep=" ", header=None)
    print("Fetching data from web and formatting...")
    data = PandasFromUrl(dataUrl)
    data.columns = ["F" + str(i) for i in range(len(data.columns))]  # prefix feature columns with "F"
    data['PASS_FAIL'] = PandasFromUrl(labelUrl)[0]
    vendors = pd.read_json(vendorUrl).sort_index()
    df = data.merge(vendors, left_index=True, right_index=True)
    df.rename(index=str, columns=newCols, inplace=True)
    df['ID'] = list(range(len(df)))
    print("Connecting to Sqlite...")
    con = sqlite3.connect("warehouse.db")
    print("Clearing table and inserting records...")
    dfUpload(df, con, "SAMPLE", clearTable=True)
    print("Disconnecting from Sqlite...")
    con.close()
    print("Done!")
    
    
if __name__ == '__main__':
    main()
Fetching data from web and formatting...
Connecting to Sqlite...
Clearing table and inserting records...
Disconnecting from Sqlite...
Done!


Uploading data:   0%|          | 0/1567 [00:00<?, ?row/s]
Uploading data:  37%|###7      | 581/1567 [00:00<00:00, 5791.97row/s]
Uploading data:  69%|######8   | 1079/1567 [00:00<00:00, 4947.52row/s]
Uploading data: 100%|##########| 1567/1567 [00:00<00:00, 4806.56row/s]

Prepare Environment

Preparing an R 3.5.2 environment for statistical analysis

  • Loading required libraries
  • Clearing cache
  • Defining a helper function
  • Delete previous results if they exist
library(DBI)
library(dplyr)
library(broom)
library(ROCR)
library(extraTrees)
library(lubridate)
library(ggplot2)
library(plotly)

rm(list = ls())  # clear all data
try(invisible(dev.off()), silent = TRUE)  # clear all plots

# helper function to print huge dataframe
printWideDataFrame <- function(df, n) {
    head(df[c(1:n, (ncol(df) - n):ncol(df))])
}

if (file.exists("./docs/ROCs.pdf")) invisible(file.remove("./docs/ROCs.pdf"))
if (file.exists("./docs/ImportantVariables.pdf")) invisible(file.remove("./docs/ImportantVariables.pdf"))

Fetch from Database

Since the data has already been normalized into Sqlite, a SELECT statement can be used to pull the table into RAM.

In a production environment:

  • For ODBC/JDBC, pass connection credentials to the connection object
  • For REST API “GET”, call web service for request/response objects
# connect to db, fetch table into RAM, disconnect from db
con <- dbConnect(RSQLite::SQLite(), "warehouse.db")
df_orig <- dbGetQuery(con, "select * from sample") %>% mutate(INSERTED_ON = as.POSIXct(INSERTED_ON), 
    MFG_DATE = as.POSIXct(MFG_DATE), PASS_FAIL = ifelse(PASS_FAIL == 1, 0, 1))
dbDisconnect(con)

# preview dataframe
printWideDataFrame(df_orig, 15)
  ID         INSERTED_ON            MFG_DATE PASS_FAIL MAT_VENDOR
1  0 2019-01-18 00:18:37 2008-07-19 11:55:00         1        ddd
2  1 2019-01-18 00:18:37 2008-07-19 12:32:00         1        eee
3  2 2019-01-18 00:18:37 2008-07-19 13:17:00         0        fff
4  3 2019-01-18 00:18:37 2008-07-19 14:43:00         1        ccc
5  4 2019-01-18 00:18:37 2008-07-19 15:22:00         1        ccc
6  5 2019-01-18 00:18:37 2008-07-19 17:53:00         1        eee
  PART_VENDOR SIL_VENDOR ADHS_VENDOR SOP_VENDOR      F0      F1       F2
1         aaa        ddd         bbb        eee 3030.93 2564.00 2187.733
2         ccc        ddd         aaa        aaa 3095.78 2465.14 2230.422
3         aaa        eee         aaa        jjj 2932.61 2559.94 2186.411
4         ccc        hhh         aaa        eee 2988.72 2479.90 2199.033
5         bbb        aaa         bbb        iii 3032.24 2502.87 2233.367
6         aaa        hhh         bbb        eee 2946.25 2432.84 2233.367
         F3     F4  F5   F574   F575   F576    F577   F578   F579   F580
1 1411.1265 1.3602 100 3.0624 0.1026 1.6765 14.9509     NA     NA     NA
2 1463.6606 0.8294 100 2.0111 0.0772 1.1065 10.9003 0.0096 0.0201 0.0060
3 1698.0172 1.5102 100 4.0923 0.0640 2.0952  9.2721 0.0584 0.0484 0.0148
4  909.7926 1.3204 100 2.8971 0.0525 1.7585  8.5831 0.0202 0.0149 0.0044
5 1326.5200 1.5334 100 3.1776 0.0706 1.6597 10.9698     NA     NA     NA
6 1326.5200 1.5334 100 2.2598 0.0899 1.6679 13.7755 0.0342 0.0151 0.0052
      F581   F582   F583   F584    F585   F586   F587   F588     F589
1       NA 0.5005 0.0118 0.0035  2.3630     NA     NA     NA       NA
2 208.2045 0.5019 0.0223 0.0055  4.4447 0.0096 0.0201 0.0060 208.2045
3  82.8602 0.4958 0.0157 0.0039  3.1745 0.0584 0.0484 0.0148  82.8602
4  73.8432 0.4990 0.0103 0.0025  2.0544 0.0202 0.0149 0.0044  73.8432
5       NA 0.4800 0.4766 0.1045 99.3032 0.0202 0.0149 0.0044  73.8432
6  44.0077 0.4949 0.0189 0.0044  3.8276 0.0342 0.0151 0.0052  44.0077

Clean Data

Using dplyr commands to:

  • Force response variable to binary
  • Add dummy variables to all strings/factors
  • Remove columns no longer required in calculations
# massage for statistics
df_stats <- df_orig %>% 
  fastDummies::dummy_cols() %>%  # add dummy variables for all string columns
  select(-c(ID, INSERTED_ON, MFG_DATE, MAT_VENDOR,
            PART_VENDOR, SIL_VENDOR, ADHS_VENDOR, SOP_VENDOR))  # drop columns

# preview dataframe
printWideDataFrame(df_stats, 15)
  PASS_FAIL      F0      F1       F2        F3     F4  F5       F6     F7
1         1 3030.93 2564.00 2187.733 1411.1265 1.3602 100  97.6133 0.1242
2         1 3095.78 2465.14 2230.422 1463.6606 0.8294 100 102.3433 0.1247
3         0 2932.61 2559.94 2186.411 1698.0172 1.5102 100  95.4878 0.1241
4         1 2988.72 2479.90 2199.033  909.7926 1.3204 100 104.2367 0.1217
5         1 3032.24 2502.87 2233.367 1326.5200 1.5334 100 100.3967 0.1235
6         1 2946.25 2432.84 2233.367 1326.5200 1.5334 100 100.3967 0.1235
      F8      F9     F10    F11      F12 F13 SIL_VENDOR_ccc SIL_VENDOR_iii
1 1.5005  0.0162 -0.0034 0.9455 202.4396   0              0              0
2 1.4966 -0.0005 -0.0148 0.9627 200.5470   0              0              0
3 1.4436  0.0041  0.0013 0.9615 202.0179   0              0              0
4 1.4882 -0.0124 -0.0033 0.9629 201.8482   0              0              0
5 1.5031 -0.0031 -0.0072 0.9569 201.9424   0              0              0
6 1.5287  0.0167  0.0055 0.9699 200.4720   0              0              0
  SIL_VENDOR_fff ADHS_VENDOR_bbb ADHS_VENDOR_aaa SOP_VENDOR_eee
1              0               1               0              1
2              0               0               1              0
3              0               0               1              0
4              0               0               1              1
5              0               1               0              0
6              0               1               0              1
  SOP_VENDOR_aaa SOP_VENDOR_jjj SOP_VENDOR_iii SOP_VENDOR_ddd
1              0              0              0              0
2              1              0              0              0
3              0              1              0              0
4              0              0              0              0
5              0              0              1              0
6              0              0              0              0
  SOP_VENDOR_ccc SOP_VENDOR_kkk SOP_VENDOR_hhh SOP_VENDOR_bbb
1              0              0              0              0
2              0              0              0              0
3              0              0              0              0
4              0              0              0              0
5              0              0              0              0
6              0              0              0              0
  SOP_VENDOR_ggg SOP_VENDOR_fff
1              0              0
2              0              0
3              0              0
4              0              0
5              0              0
6              0              0

Exploratory Data Analysis

Using javascript wrappers, an html page can be used to show interactive volumes and yields over the given time period.

In a production environment, the UI could be a(n):

  • Web visualization app
  • Mobile app
  • Other reporting needs for internal customers
# mfg volume vs time
df_orig %>% group_by(MFG_DATE = floor_date(MFG_DATE, "day")) %>% summarize(QTY_PASS = sum(PASS_FAIL == 
    1), QTY_FAIL = sum(PASS_FAIL == 0)) %>% plot_ly(x = ~MFG_DATE, y = ~QTY_PASS, 
    name = "Pass", type = "bar") %>% add_trace(y = ~QTY_FAIL, name = "Fail") %>% 
    layout(xaxis = list(title = "Manufacturing Date"), yaxis = list(title = "Volume"), 
        barmode = "stack", title = "Semi-Conductor Production Volume vs. Time")
# mfg volume vs time (cumulative)
df_orig %>% group_by(MFG_DATE = floor_date(MFG_DATE, "day")) %>% summarize(QTY = n(), 
    QTY_PASS = sum(PASS_FAIL == 1), QTY_FAIL = sum(PASS_FAIL == 0)) %>% mutate(QTY = cumsum(QTY), 
    QTY_PASS = cumsum(QTY_PASS), QTY_FAIL = cumsum(QTY_FAIL), PERC_PASS = QTY_PASS/QTY) %>% 
    plot_ly() %>% add_trace(x = ~MFG_DATE, y = ~QTY_PASS, name = "Pass", type = "bar", 
    yaxis = "y1") %>% add_trace(x = ~MFG_DATE, y = ~QTY_FAIL, name = "Fail", 
    type = "bar", yaxis = "y1") %>% add_trace(x = ~MFG_DATE, y = ~PERC_PASS, 
    type = "scatter", mode = "lines", name = "Yield %", yaxis = "y2") %>% layout(xaxis = list(title = "Manufacturing Date"), 
    yaxis = list(side = "left", title = "Volume (Cumulative)", showgrid = FALSE, 
        zeroline = FALSE), yaxis2 = list(side = "right", overlaying = "y", title = "Yield %", 
        showgrid = FALSE, zeroline = FALSE, tickformat = "%"), barmode = "stack", 
    title = "Semi-Conductor Production Volume vs. Time (Cumulative)")

Model 1: Logistic Regression

“Logistic Regression” represents a generalized linear model commonly used to fit a single binary response.

For more information about this algorithm, see section “ExtraTrees” here.

Advantages for problem statement:

  • Simple approach due to binary response variable

Disadvantages for problem statement:

  • Filling NAs with column medians introduces error
  • Stepping through to find lowest AIC is computationally expensive
# create copy
df1 <- df_stats

# impute NAs in dataframe with column medians
for (col in names(df1)) {
    # feature columns only (they start with 'F' and the other digits are
    # numeric)
    if ((substring(col, 1, 1) == "F") && !is.na(as.numeric(substring(col, 2)))) {
        df1[is.na(df1[, col]), col] <- median(df1[, col], na.rm = TRUE)
    }
}

# preview dataframe
printWideDataFrame(df1, 15)
  PASS_FAIL      F0      F1       F2        F3     F4  F5       F6     F7
1         1 3030.93 2564.00 2187.733 1411.1265 1.3602 100  97.6133 0.1242
2         1 3095.78 2465.14 2230.422 1463.6606 0.8294 100 102.3433 0.1247
3         0 2932.61 2559.94 2186.411 1698.0172 1.5102 100  95.4878 0.1241
4         1 2988.72 2479.90 2199.033  909.7926 1.3204 100 104.2367 0.1217
5         1 3032.24 2502.87 2233.367 1326.5200 1.5334 100 100.3967 0.1235
6         1 2946.25 2432.84 2233.367 1326.5200 1.5334 100 100.3967 0.1235
      F8      F9     F10    F11      F12 F13 SIL_VENDOR_ccc SIL_VENDOR_iii
1 1.5005  0.0162 -0.0034 0.9455 202.4396   0              0              0
2 1.4966 -0.0005 -0.0148 0.9627 200.5470   0              0              0
3 1.4436  0.0041  0.0013 0.9615 202.0179   0              0              0
4 1.4882 -0.0124 -0.0033 0.9629 201.8482   0              0              0
5 1.5031 -0.0031 -0.0072 0.9569 201.9424   0              0              0
6 1.5287  0.0167  0.0055 0.9699 200.4720   0              0              0
  SIL_VENDOR_fff ADHS_VENDOR_bbb ADHS_VENDOR_aaa SOP_VENDOR_eee
1              0               1               0              1
2              0               0               1              0
3              0               0               1              0
4              0               0               1              1
5              0               1               0              0
6              0               1               0              1
  SOP_VENDOR_aaa SOP_VENDOR_jjj SOP_VENDOR_iii SOP_VENDOR_ddd
1              0              0              0              0
2              1              0              0              0
3              0              1              0              0
4              0              0              0              0
5              0              0              1              0
6              0              0              0              0
  SOP_VENDOR_ccc SOP_VENDOR_kkk SOP_VENDOR_hhh SOP_VENDOR_bbb
1              0              0              0              0
2              0              0              0              0
3              0              0              0              0
4              0              0              0              0
5              0              0              0              0
6              0              0              0              0
  SOP_VENDOR_ggg SOP_VENDOR_fff
1              0              0
2              0              0
3              0              0
4              0              0
5              0              0
6              0              0
# initialize models
m1_full <- glm(PASS_FAIL ~ ., data = df1, family = binomial(), control = list(maxit = 50))
m1_null <- glm(PASS_FAIL ~ 1, data = df1, family = binomial(), control = list(maxit = 50))

# down-select variables m1_bwd <- step(m1_full, direction='backward',
# trace=0) # not good for high dimensionality problem
m1_fwd <- step(m1_null, scope = list(lower = m1_null, upper = m1_full), direction = "forward", 
    trace = 0)  # both has lower AIC
m1_both <- step(m1_null, scope = list(upper = m1_full), direction = "both", 
    trace = 0)  # best choice

if (m1_fwd$aic < m1_both$aic) {
    print("Forward direction used")
    m1_base = m1_fwd
} else {
    print("Both directions used")
    m1_base = m1_both
}
[1] "Both directions used"
summary(m1_base)

Call:
glm(formula = PASS_FAIL ~ SIL_VENDOR_eee + F59 + F21 + F73 + 
    F428 + F569 + F573 + F64 + F75 + F129 + F433 + F365 + F9 + 
    F443 + F473 + F500 + F368 + F488 + SOP_VENDOR_ggg + F411 + 
    F476 + F38 + F87 + F104 + F484 + F349 + F84 + F72 + F56 + 
    F554 + F131 + F511 + F545 + F470 + F410 + F419 + F418 + F32 + 
    SIL_VENDOR_ccc + SOP_VENDOR_aaa + F320 + F66 + F321 + F94 + 
    F132 + F490 + F471, family = binomial(), data = df1, control = list(maxit = 50))

Deviance Residuals: 
       Min          1Q      Median          3Q         Max  
-2.060e-05   2.100e-08   2.100e-08   2.100e-08   5.441e-05  

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)
(Intercept)     2.129e+05  1.175e+08   0.002    0.999
SIL_VENDOR_eee -6.116e+03  1.447e+06  -0.004    0.997
F59            -1.747e+02  4.303e+04  -0.004    0.997
F21            -8.187e-01  2.936e+02  -0.003    0.998
F73             1.386e+02  5.193e+04   0.003    0.998
F428            1.797e+02  1.448e+05   0.001    0.999
F569           -5.506e+01  2.075e+04  -0.003    0.998
F573            3.469e+03  1.713e+06   0.002    0.998
F64            -2.600e+02  7.166e+04  -0.004    0.997
F75            -1.587e+04  6.899e+06  -0.002    0.998
F129           -3.419e+02  1.112e+05  -0.003    0.998
F433           -1.254e+00  8.671e+02  -0.001    0.999
F365           -1.912e+05  6.726e+07  -0.003    0.998
F9              2.215e+04  9.519e+06   0.002    0.998
F443           -2.783e+03  9.936e+05  -0.003    0.998
F473           -1.859e+01  9.232e+03  -0.002    0.998
F500           -6.861e-01  2.940e+02  -0.002    0.998
F368            2.315e+05  1.188e+08   0.002    0.998
F488            2.967e+00  1.046e+03   0.003    0.998
SOP_VENDOR_ggg  2.065e+03  7.653e+05   0.003    0.998
F411           -3.996e+02  1.510e+05  -0.003    0.998
F476            6.828e+00  1.066e+04   0.001    0.999
F38             5.803e+02  5.785e+05   0.001    0.999
F87            -1.747e+04  1.922e+07  -0.001    0.999
F104            3.076e+05  1.511e+08   0.002    0.998
F484            1.230e+00  9.049e+02   0.001    0.999
F349           -2.574e+04  8.889e+06  -0.003    0.998
F84             5.699e+04  4.048e+07   0.001    0.999
F72            -8.331e+01  5.116e+04  -0.002    0.999
F56            -1.785e+05  6.484e+07  -0.003    0.998
F554           -3.955e+02  1.540e+05  -0.003    0.998
F131           -8.681e+04  6.803e+07  -0.001    0.999
F511           -9.658e-01  6.637e+02  -0.001    0.999
F545            3.812e+02  2.079e+05   0.002    0.999
F470            2.331e+02  9.368e+04   0.002    0.998
F410            2.207e+02  1.078e+05   0.002    0.998
F419           -9.642e-01  5.370e+02  -0.002    0.999
F418            7.056e-01  7.848e+02   0.001    0.999
F32            -1.121e+02  4.765e+04  -0.002    0.998
SIL_VENDOR_ccc -2.036e+03  8.167e+05  -0.002    0.998
SOP_VENDOR_aaa -7.362e+02  5.662e+05  -0.001    0.999
F320           -1.293e+04  4.142e+06  -0.003    0.998
F66            -7.072e+01  5.154e+04  -0.001    0.999
F321            4.860e+02  1.511e+05   0.003    0.997
F94             3.084e+06  1.283e+09   0.002    0.998
F132            4.782e+03  3.003e+06   0.002    0.999
F490            1.922e+01  1.169e+04   0.002    0.999
F471           -4.978e+01  4.420e+04  -0.001    0.999

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 7.6515e+02  on 1566  degrees of freedom
Residual deviance: 6.6192e-09  on 1519  degrees of freedom
AIC: 96

Number of Fisher Scoring iterations: 50

Model 2: Extremely Random Trees

“ExtraTrees” represents a Random Forest which:

  • Each tree is trained using the entire sample, rather than bootstrapping
  • Top-down splitting from tree learner is randomized

For more information about this algorithm, see section “ExtraTrees” here.

Advantages for problem statement:

  • Handles NAs gracefully
  • Handles noisey data and high dimensionality

Disadvantages for problem statement:

  • Difficult to check model integrity due to complexity
  • Package documentation brief or missing (in both python and R)
# create copy
df2 <- df_stats

# preview dataframe
printWideDataFrame(df2, 15)
  PASS_FAIL      F0      F1       F2        F3     F4  F5       F6     F7
1         1 3030.93 2564.00 2187.733 1411.1265 1.3602 100  97.6133 0.1242
2         1 3095.78 2465.14 2230.422 1463.6606 0.8294 100 102.3433 0.1247
3         0 2932.61 2559.94 2186.411 1698.0172 1.5102 100  95.4878 0.1241
4         1 2988.72 2479.90 2199.033  909.7926 1.3204 100 104.2367 0.1217
5         1 3032.24 2502.87 2233.367 1326.5200 1.5334 100 100.3967 0.1235
6         1 2946.25 2432.84 2233.367 1326.5200 1.5334 100 100.3967 0.1235
      F8      F9     F10    F11      F12 F13 SIL_VENDOR_ccc SIL_VENDOR_iii
1 1.5005  0.0162 -0.0034 0.9455 202.4396   0              0              0
2 1.4966 -0.0005 -0.0148 0.9627 200.5470   0              0              0
3 1.4436  0.0041  0.0013 0.9615 202.0179   0              0              0
4 1.4882 -0.0124 -0.0033 0.9629 201.8482   0              0              0
5 1.5031 -0.0031 -0.0072 0.9569 201.9424   0              0              0
6 1.5287  0.0167  0.0055 0.9699 200.4720   0              0              0
  SIL_VENDOR_fff ADHS_VENDOR_bbb ADHS_VENDOR_aaa SOP_VENDOR_eee
1              0               1               0              1
2              0               0               1              0
3              0               0               1              0
4              0               0               1              1
5              0               1               0              0
6              0               1               0              1
  SOP_VENDOR_aaa SOP_VENDOR_jjj SOP_VENDOR_iii SOP_VENDOR_ddd
1              0              0              0              0
2              1              0              0              0
3              0              1              0              0
4              0              0              0              0
5              0              0              1              0
6              0              0              0              0
  SOP_VENDOR_ccc SOP_VENDOR_kkk SOP_VENDOR_hhh SOP_VENDOR_bbb
1              0              0              0              0
2              0              0              0              0
3              0              0              0              0
4              0              0              0              0
5              0              0              0              0
6              0              0              0              0
  SOP_VENDOR_ggg SOP_VENDOR_fff
1              0              0
2              0              0
3              0              0
4              0              0
5              0              0
6              0              0
# run model and summarize
m2_base = extraTrees(df2 %>% select(-PASS_FAIL), df2$PASS_FAIL, numRandomCuts = 1, 
    na.action = "fuse")
m2_base
ExtraTrees:
 - # of trees: 500
 - node size:  5
 - # of dim:   622
 - # of tries: 207
 - type:       numeric (regression)
 - multi-task: no
# extraTrees does not have a variable importance function

Cross Validation

Perform 5 fold cross validation for both models and generate ROC graphs here.

# create copy
df <- df_stats

n = 5  # number of folds
df = df[sample(nrow(df)), ]  # Randomly shuffle the data
folds = cut(seq(1, nrow(df)), breaks = n, labels = FALSE)  # Create 5 equally size folds

# create empty matrix for accuracy and precision
accuracy = as.data.frame(matrix(data = NA, nrow = n, ncol = 2))
precision = as.data.frame(matrix(data = NA, nrow = n, ncol = 2))
names(accuracy) = c("m1", "m2")
names(precision) = c("m1", "m2")
cutoff = 0.5

pdf(file = "./docs/ROCs.pdf", width = 10, height = 7.5)  # begin pdf writer

# Perform 5 fold cross validation
for (i in 1:n) {
    # Segment the data by fold using the which() function
    testIndexes = which(folds == i, arr.ind = TRUE)
    testData = df[testIndexes, ]
    trainData = df[-testIndexes, ]
    
    # model 1: logistic regression
    m1 = glm(m1_both$formula, data = trainData, family = "binomial", control = list(maxit = 50))
    p1 = predict(m1, newdata = testData, type = "response")
    pr1 = prediction(p1, testData$PASS_FAIL)
    prf1 = performance(pr1, measure = "tpr", x.measure = "fpr")
    prec1 = performance(pr1, measure = "prec")
    acc1 = performance(pr1, measure = "acc")
    auc1 = performance(pr1, measure = "auc")
    
    # model 2: extremely random forest
    m2 = extraTrees(trainData %>% select(-PASS_FAIL), trainData$PASS_FAIL, numRandomCuts = 1, 
        na.action = "fuse")
    p2 = predict(m2, testData %>% select(-PASS_FAIL))
    pr2 = prediction(p2, testData$PASS_FAIL)
    prf2 = performance(pr2, measure = "tpr", x.measure = "fpr")
    prec2 = performance(pr2, measure = "prec")
    acc2 = performance(pr2, measure = "acc")
    auc2 = performance(pr2, measure = "auc")
    
    # graph results
    par(pty = "s")
    plot(prf1, main = paste("ROC: Fold ", i, sep = ""), xaxs = "i", yaxs = "i", 
        asp = 1)
    lines(prf2@x.values[[1]], prf2@y.values[[1]], col = "red")
    abline(a = 0, b = 1, lty = 2)
    legend("bottomright", c(paste("Model 1 | AUC=", format(round(auc1@y.values[[1]], 
        3), 3), sep = ""), paste("Model 2 | AUC=", format(round(auc2@y.values[[1]], 
        3), 3), sep = "")), col = c("black", "red"), lty = c(1, 1))
    
    par(pty = "m")
    plot(prec1, main = paste("Precision: Fold ", i, sep = ""), ylim = c(0.4, 
        1))
    lines(prec2@x.values[[1]], prec2@y.values[[1]], col = "red")
    abline(v = 0.5, lty = 2)
    legend("topleft", c("Model 1", "Model 2"), col = c("black", "red"), lty = c(1, 
        1))
    
    plot(acc1, main = paste("Accuracy: Fold ", i, sep = ""), ylim = c(0.4, 1))
    lines(acc2@x.values[[1]], acc2@y.values[[1]], col = "red")
    abline(v = 0.5, lty = 2)
    legend("topleft", c("Model 1", "Model 2"), col = c("black", "red"), lty = c(1, 
        1))
    
    accuracy$m1[i] = acc1@y.values[[1]][max(which(acc1@x.values[[1]] >= cutoff))]
    accuracy$m2[i] = acc2@y.values[[1]][max(which(acc2@x.values[[1]] >= cutoff))]
    
    precision$m1[i] = prec1@y.values[[1]][max(which(prec1@x.values[[1]] >= cutoff))]
    precision$m2[i] = prec2@y.values[[1]][max(which(prec2@x.values[[1]] >= cutoff))]
    
}

invisible(dev.off())  # close pdf writer

Comparing Models

A t-test can be used to compare models on accuracy and precision results.

# defined as null hypothesis: m1-m2=0
accuracy_test = t.test(accuracy$m1, accuracy$m2, conf.level = 0.95, paired = T)
precision_test = t.test(precision$m1, precision$m2, conf.level = 0.95, paired = T)

accuracy
         m1        m2
1 0.9500000 0.9520295
2 0.9207921 0.9295775
3 0.9294118 0.9589552
4 0.9417476 0.9434629
5 0.8936170 0.9444444
accuracy_test

    Paired t-test

data:  accuracy$m1 and accuracy$m2
t = -1.9508, df = 4, p-value = 0.1229
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.045024779  0.007864334
sample estimates:
mean of the differences 
            -0.01858022 
if (accuracy_test$p.value > 0.05) {
    print("Model 1 and Model 2 accuracies are not significantly different.")
} else if (mean(accuracy$m1) > mean(accuracy$m2)) {
    print("Model 1 is statistically more accurate than Model 2.")
} else {
    print("Model 2 is statistically more accurate than Model 1.")
}
[1] "Model 1 and Model 2 accuracies are not significantly different."
precision
         m1        m2
1 0.9680851 0.9723320
2 0.9574468 0.9477612
3 0.9729730 0.9762846
4 0.9787234 0.9626866
5 0.9873418 0.9581882
precision_test

    Paired t-test

data:  precision$m1 and precision$m2
t = 1.5133, df = 4, p-value = 0.2048
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.007899128  0.026826147
sample estimates:
mean of the differences 
             0.00946351 
if (precision_test$p.value > 0.05) {
    print("Model 1 and Model 2 precisions are not significantly different.")
} else if (mean(precision$m1) > mean(precision$m2)) {
    print("Model 1 is statistically more precise than Model 2.")
} else {
    print("Model 2 is statistically more precise than Model 1.")
}
[1] "Model 1 and Model 2 precisions are not significantly different."

Important Variables

The important variables to be exported are from Model 1. With neither model being significantly different from one another in terms of accuracy or precision, Model 1 variables were chosen due to the lack of documentation for the “ExtraTrees” package on how to extract these variables in Model 2.

The time series graphs of the important variables can be found here.

options(warn=-1)
pdf(file='./docs/ImportantVariables.pdf',width=10,height=7.5)  # begin pdf writer
for(name in sort(names(m1_base$coefficients)[-1])){
  p <- df_orig %>%
    fastDummies::dummy_cols() %>%  # add dummy variables for all string columns
    mutate(PASS_FAIL = as.factor(ifelse(PASS_FAIL==1,"PASS","FAIL"))) %>%
    ggplot(aes_string(x = "MFG_DATE", y = name)) +
    geom_point(alpha = 0.4, aes(colour = PASS_FAIL)) +
    scale_color_manual(values = c("FAIL" = "red", "PASS" = "black")) +
    guides(colour = guide_legend(override.aes = list(alpha = 1, size = 2))) +
    ggtitle(paste("Variable:",name)) +
    xlab("Manufacturing Date") +
    ylab("Value")
  print(p)
}
invisible(dev.off())
options(warn=0)

Future Work

  • Remove all convergence warnings from models
  • Penalize models for over-fitting
  • Tune existing model parameters
  • Research more models for high dimensionality and noisey data with low observations
  • Create an enterprise level solution
  • Qualitatively investigate all sensors/variables identified a driving failures
  • Research for additional relevant data to help classification

Shane Lanan

January 18, 2019